plot() vs. ggplot()Welcome to this non-exhaustive list and description of the basics of programming in R, and of some of the most useful commands to work with datasets of the kind you would get in a typical psychology experiment (e.g. responses, reaction times, etc.).
For me it is useful as a place where I can quickly look things up.
For you it might serve the same purpose. If instead you are new to R or to programming in general, this doc (and the links in it) can be a good place to get started.
Be aware that although this doc is primarily about R, you will sometimes find references to and comparisons with the Matlab language (another program I and psychologists/neuroscientists in general like to use).
Good luck!!
Sebastian Korb, Ph.D.
Faculty of Psychology, University of Vienna, Austria
My webpage For comments please write to sebkorb1[at]gmail.com
R is completely free, and there is a large community using it. World’s best statisticians and programmers constantly improve old functions and create and distribute new R functions and packages. If you have a question about how to do anything in R, you most likely will find an answer to it through a Google search.
Using RStudio (an editor that interfaces and helps with R) and its RMarkdown function, you can combine R code, analyses, plots, and written text into 1 nice-looking document. This allows to easily share all your data manipulation and analysis steps, making them transparent, shareable, and reproducible
Wanting to compare R and SPSS, a certain Greg Snow wrote the following on stackoverflow:
Busses are very easy to use, you just need to know which bus to get on, where to get on, and where to get off (and you need to pay your fare). Cars on the other hand require much more work, you need to have some type of map or directions (even if the map is in your head), you need to put gas in every now and then, you need to know the rules of the road (have some type of drivers licence). The big advantage of the car is that it can take you a bunch of places that the bus does not go and it is quicker for some trips that would require transfering between busses.
Using this analogy programs like SPSS are busses, easy to use for the standard things, but very frustrating if you want to do something that is not already preprogrammed.
R is a 4-wheel drive SUV (though environmentally friendly) with a bike on the back, a kayak on top, good walking and running shoes in the pasenger seat, and mountain climbing and spelunking gear in the back.
R can take you anywhere you want to go if you take time to learn how to use the equipment, but that is going to take longer than learning where the bus stops are in SPSS.
see examples in the R-graph gallery,
and description of the wordcloud2 package here
Several word clouds were included in this review article:
To create your own word clouds, you need to install the wordcloud2 package
library(wordcloud2)
head(demoFreq) # some word frequency count
## word freq
## oil oil 85
## said said 73
## prices prices 48
## opec opec 42
## mln mln 31
## the the 26
# Now let's make the corresponding word cloud. Notice how frequent words like 'oil' appear larger compared to less
# frequent words like 'the'
wordcloud2(data = demoFreq)
Use the package Plotly to create interactive graphs,
which you can also embed in your website.
This allows you to better explore, understand, and explain your data.
install.packages("plotly")
Let’s make interactive boxplots with the dataset “midwest” from the package “ggplot2”.
It contains demographic information of midwest counties across 5 states (IL, IN, MI, OH, WI).
We’ll focus on the percentage of people with a college education.
library(plotly)
data(midwest)
p <- plot_ly(midwest, x = ~percollege, color = ~state, type = "box")
p
Moving the cursor over the graph allows you to see the values corresponding to the quartiles and the outliers.
For example, you can see that in Wisconsin
Dane county (where the capital Madison lies) has a 43.6 % of college educated people.
as.character(midwest[midwest$state == "WI" & midwest$percollege >= 43, "county"])
## [1] "DANE"
version in the Consoleinstall.packages("ggplot2").library(ggplot2) to (the beginnning of) your script. Alternatively, use the “Packages”" tab in the bottom-right corner of the RStudio interface. * There is also the possibility to use a function from a package without loading it, or better by loading it temporarily. For example, to use the
chordDiagram function from the package circlize:
circlize::chordDiagram(matrix(sample(10), nrow = 4, ncol = 5))
Packages are sets of functions that some smart R-wizard wrote and made available for everybody.
Most packages can be downloades from the Comprehensive R Archive Network (CRAN). But actually you don’t need to go to that website and instead can do it from within R or RStudio.
For that hit the “Pachages” tab (probably on the lower right of your RStudio), then “Install”, and enter the name of the desired package. RStudio should download this automatically. Alternatively use the function install.packages("NAME.OF.YOUR.PACKAGE").
Now the package has been installed, and will be included in the list of packages in RStudio. However, you still need to load it everytime you want to use it. For that either click the box next to the package name in the “Packages” tab of RStudio. Or use this command: library(NAME.OF.YOUR.PACKAGE) (no more quotation marks needed this time).
If you want to see a list of all the packages that you have currently loaded, type search(), and to see where on your computer these packages are stored type searchpaths()
Some of the packages I use most are: ggplot2 for graphs, lme4 for linear mixed models, but also car, Hmisc .
If you are not using the latest R version (e.g. because your old OS does not allow it),
it might be necessary to install older versions of certain packages.
THIS link explains you how to do it.
require(devtools)
install_version("ggplot2", version = "0.9.1", repos = "http://cran.us.r-project.org")
Sometimes a package might not install or run because your R version is too old. To find out which version you are using, type version in the Console and hit Enter
version
## _
## platform x86_64-w64-mingw32
## arch x86_64
## os mingw32
## system x86_64, mingw32
## status
## major 3
## minor 5.1
## year 2018
## month 07
## day 02
## svn rev 74947
## language R
## version.string R version 3.5.1 (2018-07-02)
## nickname Feather Spray
? followed by the name of the function, e.g. ?mean or (with space) ? mean, or get to the start of the online help with help.start()I have used R Markdown to create this file.
In a nutshell, R Markdown allows you to write a neat looking .html or .pdf file containing both text and R code (or code in other languages like Bash or Python).
The code can be executed and you can have the results show up in your .html file.
I highly recommend checking it out and using it for teaching (you can also create presentations - similar to what you would do with Powerpoint) or to create a data analysis report that can be shared with you coworkers.
Some places where you might find help to learn R Markdown on the internet. Here is one useful resource, and here another one.
"<-" and then whatever you want to put into it. It does also work if you replace the "<-" with "=", but R users will almost certainly look down on you.<-), although you could also use an equal sign (=)# for example, if I wanted to put the year 2018 into an object (also called variable) with the name thisyear, I'd write
thisyear <- 2018
# this automatically creates it and one way to check is by typing just the name
thisyear # R is case sensitive, so be careful with the big and small caps!
## [1] 2018
Everything you can do in the Console you can also do in a Script
Using scripts allows you to save your commands, and reuse them on later occasions.
This also allows to save time be adapting parts of old scripts to analyze new data
To open a new script, go to File > New File > Script … or use the shortcut (on a Mac it is: cmd + Shift + Enter)
To execute the entire script, hit the RUN button
To execute 1 line, put the cursor somewhere in that line and hit crtl + Enter (windows) or cmd + Enter (Mac)
To execute more than 1 line select them (holding down left mouse button) and then hit crtl/cmd + Enter
By default R goes to the directory in which R is saved on your computer.
Verify with:
getwd() # find out which folder on your computer R is looking at
## [1] "X:/PIs/Seba/Teaching/R/R_important_commands_list_Seba"
Type setwd("/DISK/FOLDER/") to change working directory to Or use the mouse to go to Session > Set Working Directory
An object is anything that you creates and that can contain numbers, text, plots, or statiscal models. Per convention objects are created by writing the name of the object followed by <- and then whatever you want to put into it. It does also work if you replace the <- with = (as it is done in Matlab), but R users will almost certainly look down on you.
# for example, if I wanted to put the year 2017 into an object (also called variable) with the name thisyear, I'd write
thisyear <- 2017
# this automatically creates it and one way to check is by typing just the name
thisyear # R is case sensitive, so be careful with the big and small caps!
## [1] 2017
When I start a new R script I typically first delete everything in the Workspace (in RStudio it is called Environment) For that I use:
rm(list = ls())
If instead you want to delete only a specific object, say the variable a :
a <- 1 # let's create the variable with name 'a' and fill it with the integer 1
rm(a)
# or remove several variables at once (however without emptying the Workspace)
rm(a, b, c)
Of course you could do this simply by hitting the Environment tab…. …but here is how you do it by writing in the Console:
a <- 1 # let's create variable a again
ls()
## [1] "a"
Myfolder <- "/Users/seba/TESTFOLDER/"
dir.create(Myfolder)
To list all the files in a specific directory, you can use the function list.files()
list.files(Myfolder) # if you don't write anything it lists what is in the working directory
You can also look for files with a specific extension
# To only list files containing the .html extension
Sourcedir = Myfolder
list.files(path = Sourcedir, pattern = ".html")
Or you might want to be even more specific, and indicate an extension and some text in the middle of the name
# To select .html files that also contain 'course' in the name You create 'wildcards' with `.*`
Sourcedir = Myfolder
list.files(path = Sourcedir, pattern = ".*course.*.html")
You come to the same solution using the function Sys.glob() , which allows to do wildcard expansion (also known al “globbing”) on file paths
Sys.glob("*course*.html")
There are several types of data R can handle (e.g. see here for a list).
The data types you will be working with most of the time are these:
# create a numeric vector. use c() to combine values.
a <- c(0, NA, 3.2, -15, sqrt(4)) # NA means not available; sqrt() computes the square root
# create a character vector
b <- c("Hello", "World", "how", "are", "you") # the quotes (single or double both work) indicate a text
# create a logical vector
c <- c(TRUE, FALSE, TRUE, FALSE, FALSE) # must all be caps!
# create a data frame with the vectors a, b,and c that we just created
my_first_dataframe <- data.frame(a, b, c)
# we could also change the column names (currently they are a, b, c)
colnames(my_first_dataframe) <- c("some_numbers", "my_sentence", "logic_test")
# now let's have a look at it
my_first_dataframe
## some_numbers my_sentence logic_test
## 1 0.0 Hello TRUE
## 2 NA World FALSE
## 3 3.2 how TRUE
## 4 -15.0 are FALSE
## 5 2.0 you FALSE
You might have noticed, that I put underscores ( _ ) instead of spaces in the column names. The reason is that this makes working with them easier, trust me.
# create a list of 5 elements
my_first_list <- list(aName = "whatever", aNum = 3.7, NumericVector = a, CharacterVector = b, Matrix = my_first_dataframe)
# the 2nd column in the just created data frame "my_first_dataframe"" is a factor
# R has decided so by itself, since that column does not contain numbers or logical data
# One way to find out is this:
str(my_first_dataframe)
## 'data.frame': 5 obs. of 3 variables:
## $ some_numbers: num 0 NA 3.2 -15 2
## $ my_sentence : Factor w/ 5 levels "are","Hello",..: 2 4 3 1 5
## $ logic_test : logi TRUE FALSE TRUE FALSE FALSE
# compare to the list "my_first_list"
str(my_first_list)
## List of 5
## $ aName : chr "whatever"
## $ aNum : num 3.7
## $ NumericVector : num [1:5] 0 NA 3.2 -15 2
## $ CharacterVector: chr [1:5] "Hello" "World" "how" "are" ...
## $ Matrix :'data.frame': 5 obs. of 3 variables:
## ..$ some_numbers: num [1:5] 0 NA 3.2 -15 2
## ..$ my_sentence : Factor w/ 5 levels "are","Hello",..: 2 4 3 1 5
## ..$ logic_test : logi [1:5] TRUE FALSE TRUE FALSE FALSE
** Change a factor to numeric or character It is possible to change a factor into a numeric vector (if it contains numbers), or into a vector filled with characters (if it contains text).
# we add a new column to my_first_dataframe, call it 'my_sentence_char', and fill it with a character version of 'my
# sentence' (i.e. we transform from factor to character)
my_first_dataframe$my_sentence_char <- as.character(my_first_dataframe$my_sentence)
# then we check again
str(my_first_dataframe)
## 'data.frame': 5 obs. of 4 variables:
## $ some_numbers : num 0 NA 3.2 -15 2
## $ my_sentence : Factor w/ 5 levels "are","Hello",..: 2 4 3 1 5
## $ logic_test : logi TRUE FALSE TRUE FALSE FALSE
## $ my_sentence_char: chr "Hello" "World" "how" "are" ...
(this is perfect for reading delimited files, defaulting to the TAB character for the delimiter)
but see also read.table(), read.csv(), read.spss()
Here is a useful tutorial by DataCamp on how to import different types of data into R
First you need to tell R where to find the data. You can do this by setting the working directory to the right folder using setwd()
But I actually have the habit of first defining a working directory in my script, and then concatenating its name with the name of the file I want to read in.
SourceDir <- "/Users/seba/TESTFOLDER/"
Mydata <- read.delim(paste(SourceDir, "MyFile.txt", sep = ""), header = TRUE)
# the function *paste* concatenates objects, in this case SourceDir and 'MyFile.txt' with sep='' I asked it to do so
# without adding anything in between
# if your data is a tab-delimited .txt file, you can indicate the separator
Mydata <- read.table(paste(SourceDir, "MyFile.txt", sep = ""), header = TRUE, sep = "\t")
R comes with some data of its own (in the package datasets), and the same is true for many additional packages. On this web page some of the more common datasets are described. To know all the data your R has built in type data(). To know the data in a specific package type try(data(package = "datasets") )
In the package “datasets” you’ll find the data mtcars, which is one of the better known built-in datasets. Load it with data(mtcars)
data(mtcars) # load into R the built-in dataset called 'mtcars'
# this creates the object mtcars in the Environment
To get some information about this dataset type help(mtcars) or ?mtcars
You can also look at the actual data, either by typing mtcars in the Console (this will write all the data), or by double clicking with the left mouse button on mtcars in the Environment tab. This will open a new window, which will look very similar to an Excel Worksheet. The same can be obtained by typing View(mtcars) in the Console.
You can save as
write.table(mtcars, "mydata.txt", sep="\t") (this can be opened by every program, and can be read in Excel)If you do not indicate a path, it will be saved in the working directory (remember that you can use getwd() to know, and setwd() to change the working directory)
I personally prefer defining a “Savedir” folder at the beginning of my script, and then use that when saving
Savedir <- "/Users/seba/my_R"
dir.create(Savedir) # if you need to create the folder (it will not be overwritten if it exists)
write.table(mtcars, paste(Savedir, "/this_is_my_dataframe.txt", sep = ""), sep = "\t")
# this works almost fine. But better to exclude the row names using row.names=FALSE:
write.table(mtcars, paste(Savedir, "/this_is_my_dataframe.txt", sep = ""), sep = "\t", row.names = FALSE)
You can also save directly to an .xlsx spreadsheet using write.xlsx()
library(xlsx)
write.xlsx(mtcars, paste(Savedir, "/this_is_my_dataframe.xlsx", sep = ""), row.names = FALSE)
Save to SPSS using write.foreign() (but honestly this didn’t work for me, so I recommend importing the .txt file in SPSS instead)
library(foreign)
rownames(mtcars) <- c()
write.foreign(mtcars, paste(Savedir, "/this_is_my_dataframe.txt", sep = ""), paste(Savedir, "/this_is_my_dataframe.sps",
sep = ""), package = "SPSS")
class(mtcars) # the mtcars object is a data.frame, something similar to an Excel spreadsheet
## [1] "data.frame"
dim(mtcars) # get dimensions (rows, columns)
## [1] 32 11
nrow(mtcars) # get the number of rows (specific for data.frames)
## [1] 32
ncol(mtcars) # get the number of columns (specific for data.frames)
## [1] 11
names(mtcars) # get the variablenames
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
library(car)
some(mtcars) # randomly selects 10 rows (part of the car package)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
head(mtcars) # shows the first 6 rows
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
tail(mtcars) # shows the last 6 rows
## mpg cyl disp hp drat wt qsec vs am gear carb
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
summary(mtcars) # summarizes data per column (min, median, mean, max, etc.)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
library(Hmisc) # load the Hmisc package (but the describe function also exists in the psych package)
describe(mtcars)
## mtcars
##
## 11 Variables 32 Observations
## ---------------------------------------------------------------------------
## mpg
## n missing distinct Info Mean Gmd .05 .10
## 32 0 25 0.999 20.09 6.796 12.00 14.34
## .25 .50 .75 .90 .95
## 15.43 19.20 22.80 30.09 31.30
##
## lowest : 10.4 13.3 14.3 14.7 15.0, highest: 26.0 27.3 30.4 32.4 33.9
## ---------------------------------------------------------------------------
## cyl
## n missing distinct Info Mean Gmd
## 32 0 3 0.866 6.188 1.948
##
## Value 4 6 8
## Frequency 11 7 14
## Proportion 0.344 0.219 0.438
## ---------------------------------------------------------------------------
## disp
## n missing distinct Info Mean Gmd .05 .10
## 32 0 27 0.999 230.7 142.5 77.35 80.61
## .25 .50 .75 .90 .95
## 120.83 196.30 326.00 396.00 449.00
##
## lowest : 71.1 75.7 78.7 79.0 95.1, highest: 360.0 400.0 440.0 460.0 472.0
## ---------------------------------------------------------------------------
## hp
## n missing distinct Info Mean Gmd .05 .10
## 32 0 22 0.997 146.7 77.04 63.65 66.00
## .25 .50 .75 .90 .95
## 96.50 123.00 180.00 243.50 253.55
##
## lowest : 52 62 65 66 91, highest: 215 230 245 264 335
## ---------------------------------------------------------------------------
## drat
## n missing distinct Info Mean Gmd .05 .10
## 32 0 22 0.997 3.597 0.6099 2.853 3.007
## .25 .50 .75 .90 .95
## 3.080 3.695 3.920 4.209 4.314
##
## lowest : 2.76 2.93 3.00 3.07 3.08, highest: 4.08 4.11 4.22 4.43 4.93
## ---------------------------------------------------------------------------
## wt
## n missing distinct Info Mean Gmd .05 .10
## 32 0 29 0.999 3.217 1.089 1.736 1.956
## .25 .50 .75 .90 .95
## 2.581 3.325 3.610 4.048 5.293
##
## lowest : 1.513 1.615 1.835 1.935 2.140, highest: 3.845 4.070 5.250 5.345 5.424
## ---------------------------------------------------------------------------
## qsec
## n missing distinct Info Mean Gmd .05 .10
## 32 0 30 1 17.85 2.009 15.05 15.53
## .25 .50 .75 .90 .95
## 16.89 17.71 18.90 19.99 20.10
##
## lowest : 14.50 14.60 15.41 15.50 15.84, highest: 19.90 20.00 20.01 20.22 22.90
## ---------------------------------------------------------------------------
## vs
## n missing distinct Info Sum Mean Gmd
## 32 0 2 0.739 14 0.4375 0.5081
##
## ---------------------------------------------------------------------------
## am
## n missing distinct Info Sum Mean Gmd
## 32 0 2 0.724 13 0.4062 0.498
##
## ---------------------------------------------------------------------------
## gear
## n missing distinct Info Mean Gmd
## 32 0 3 0.841 3.688 0.7863
##
## Value 3 4 5
## Frequency 15 12 5
## Proportion 0.469 0.375 0.156
## ---------------------------------------------------------------------------
## carb
## n missing distinct Info Mean Gmd
## 32 0 6 0.929 2.812 1.718
##
## Value 1 2 3 4 6 8
## Frequency 7 10 3 10 1 1
## Proportion 0.219 0.312 0.094 0.312 0.031 0.031
## ---------------------------------------------------------------------------
str(mtcars) # this is what you would get by clicking the >Arrow button next to the mtcars object in the Environment
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Verify distribution, and check if variables follow a Normal distribution
A common assumption necessary for many statistical analyses is that the data is approximately normally distributed.
There are several ways of testing this.
For visual inspection, use a histogram and the qqPlot (quantile-quantile plot).
For the qqplot we sort our data, plot it against a specially-designed x-axis based on our reference distribution (e.g., the Gaussian “bell curve”), and look to see whether the points lie approximately on a straight line.
If the data is normally distributed, the points will align on the diagonal of the qqplot
The variable “weight” in the “ChickWeight” dataset clearly does NOT follow a normal distribution
# load the 'car'' package to use the function qqPlot()
library(car)
par(mfrow = c(1, 2))
hist(ChickWeight$weight)
qqPlot(ChickWeight$weight)
## [1] 400 399
For a numerical test of normality, use the Shapiro-Wilk test. If the test is signifiacnt (p<.05), then the data is NOT normally distributed.
But keep in mind that tiny variations from normality can lead to a significant Shapiro-Wilk test if the sample size is large (or the dataset contains many trials). Also, the sample size must be between 3 and 5000. Otherwise the test won’t work!
Thus do not blindly rely on this test, and always look at the histogram and qqplot too.
shapiro.test(ChickWeight$weight)
##
## Shapiro-Wilk normality test
##
## data: ChickWeight$weight
## W = 0.90866, p-value < 2.2e-16
Skewness, kurtosis, and the Shapiro-Wilk test are also obtained if using the stat.desc() function from the pastecs package.
Put norm = TRUE to include the Shapiro-Wilk test
library(pastecs)
stat.desc(ChickWeight$weight, basic = FALSE, norm = TRUE)
## median mean SE.mean CI.mean.0.95 var
## 1.030000e+02 1.218183e+02 2.956204e+00 5.806232e+00 5.051223e+03
## std.dev coef.var skewness skew.2SE kurtosis
## 7.107196e+01 5.834258e-01 9.587845e-01 4.717394e+00 3.372147e-01
## kurt.2SE normtest.W normtest.p
## 8.309955e-01 9.086615e-01 3.873675e-18
Another assumption for parametric statistics is the homogeneity of variances (Data from multiple groups have the same variance).
It can be tested with Levene’s test using the function leveneTest() from the car package (centering on the median by default).
If the Levene test is significant, it means that the assumption of homogeneity of variances has been violated.
library(car)
leveneTest(ChickWeight$weight, ChickWeight$Diet, center = median)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 9.6001 3.418e-06 ***
## 574
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A straightforward way to index elements in a vector or matrix (data frame) is by giving the element number inside square brackets. If you want only 1 element of a vector, then 1 number in brackets is enough.
# For example, let's create a vector called 'whatever' with 6 numbers:
whatever <- c(1, 2, 3, 4, 5, 6)
whatever
## [1] 1 2 3 4 5 6
# btw, you get the same by writing it like this
whatever <- c(1:6) # the colon tells R to include the numbers 1 and 6, and create all the numbers increasing by 1 in between
whatever # let's check again
## [1] 1 2 3 4 5 6
Btw, in Matlab (another program that psychologists and neuroscientists like to use) you could easily create a vector in which numbers increase by 2: whatever_2 <- c(1:2:6) but unfortunately this does not work in R (you can do it with the function seq() though)
Now lets extract the 3rd number in the vector “whatever”.
whatever[3] # Can you guess what it will be?
## [1] 3
And the 2nd to 5th element of “whatever”:
whatever[2:5]
## [1] 2 3 4 5
In case you were wondering what they are, the numbers in [square brackets] on the left of the output help you see how many elements there are
With dataframes it works slightly differently. When you work with dataframes, you need to input 2 numbers, i.e. rows and columns (in this order!) separated by comma.
Indexing of a Matlab matrix works similarly, with the difference that in Matlab you could also do linear indexing, which means going down the columns one by one.
mtcars[, 3] # all the elements (rows) of the 3rd column in the dataframe mtcars
## [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6
## [12] 275.8 275.8 275.8 472.0 460.0 440.0 78.7 75.7 71.1 120.1 318.0
## [23] 304.0 350.0 400.0 79.0 120.3 95.1 351.0 145.0 301.0 121.0
mtcars[3, ] # all the elements (columns) of the 3rd row in the dataframe mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Datsun 710 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
mtcars[1:5, c(3, 4)] # The first 5 rows of columns 3 and 4 in the dataframe mtcars
## disp hp
## Mazda RX4 160 110
## Mazda RX4 Wag 160 110
## Datsun 710 108 93
## Hornet 4 Drive 258 110
## Hornet Sportabout 360 175
# this gives the last 17 rows and the last 6 columns of the dataframe mtcars, based on the total number of rows and
# columns
mtcars[(nrow(mtcars)/2):nrow(mtcars), ceiling((ncol(mtcars)/2)):ncol(mtcars)]
## wt qsec vs am gear carb
## Lincoln Continental 5.424 17.82 0 0 3 4
## Chrysler Imperial 5.345 17.42 0 0 3 4
## Fiat 128 2.200 19.47 1 1 4 1
## Honda Civic 1.615 18.52 1 1 4 2
## Toyota Corolla 1.835 19.90 1 1 4 1
## Toyota Corona 2.465 20.01 1 0 3 1
## Dodge Challenger 3.520 16.87 0 0 3 2
## AMC Javelin 3.435 17.30 0 0 3 2
## Camaro Z28 3.840 15.41 0 0 3 4
## Pontiac Firebird 3.845 17.05 0 0 3 2
## Fiat X1-9 1.935 18.90 1 1 4 1
## Porsche 914-2 2.140 16.70 0 1 5 2
## Lotus Europa 1.513 16.90 1 1 5 2
## Ford Pantera L 3.170 14.50 0 1 5 4
## Ferrari Dino 2.770 15.50 0 1 5 6
## Maserati Bora 3.570 14.60 0 1 5 8
## Volvo 142E 2.780 18.60 1 1 4 2
But a really cool thing in R is that it is very easy to index a column in a dataframe based on its name (since 2013, this is also possible when using the “Table” format in Matlab, but not as easy as it is in R).
# for example, I could first check again the names of the columns in mtcars
names(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
# or use colnames() which is basically the same, use rownames() for the row names
colnames(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
# then I can index with the '$' sign the column called 'hp', which describes the power of a motor in horse power
mtcars$hp # this gives you all the elements (rows) in the column hp
## [1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230
## [18] 66 52 65 97 150 150 245 175 66 91 113 264 175 335 109
# notice that if you extract a column from a dataframe and save it in a new object it no longer will be a dataframe!
mtcars_hp <- mtcars$hp
class(mtcars_hp) # we now got a vector of numeric elements
## [1] "numeric"
typeof(mtcars_hp) # for R numbers are of the type 'double'
## [1] "double"
# Oddly, if you extract a row from a dataframe it is still a dataframe (some things are just inconsistent I guess)
mtcars_row3 <- mtcars[3, ]
class(mtcars_row3)
## [1] "data.frame"
We could however also force mtcars_hp to be a dataframe, just as mtcars, from where it originally came from. This we do with as.data.frame()
mtcars_hp <- as.data.frame(mtcars$hp)
class(mtcars_hp)
## [1] "data.frame"
What are all the unique numbers of cylinders that cars in the mtcars dataset have?
unique(mtcars$cyl)
## [1] 6 4 8
Or do it with the table() function, which also tells you how many car models exist for each number of cylinders
table(mtcars$cyl)
##
## 4 6 8
## 11 7 14
In your own datasets the table() function might come handy to quickly check how many trials you have per participant and per condition!
The function order() gives the ranked position of each element when it is applied to a variable
# for example, we can find the ascending rank of the elements of a
a
## [1] 0.0 NA 3.2 -15.0 2.0
order(a)
## [1] 4 1 5 3 2
The smallest element of a is -15, and it is in 4th position. The next smallest element is 0, which is the 1st element of a…. and so forth
Knowing this, we can use the output of order(a) to reshuffle a:
a[order(a)]
## [1] -15.0 0.0 2.0 3.2 NA
It works similarly with data frames:
# sort data frame mtcars by 1 variable
new_order <- order(mtcars$mpg) # get the indexes corresponding to the smallest to largest mpg
mtcars_sort <- mtcars[new_order, ] # now put the rows of mtcars into the new order
head(mtcars_sort)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
## Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
## Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
## Maserati Bora 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8
# sort matrix mtcars by 2 variables: miles per gallon and displacement
new_order <- order(mtcars$mpg, mtcars$disp) # use mpg and disp
mtcars_sort <- mtcars[new_order, ]
head(mtcars_sort)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
## Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
## Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
## Maserati Bora 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8
# sort by mpg ascending, and by cyl descending (by putting a minus sign)
new_order <- order(mtcars$mpg, -mtcars$cyl)
mtcars_sort <- mtcars[new_order, ]
head(mtcars_sort)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
## Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
## Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
## Maserati Bora 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8
Create a vector filled with 10 times the same number:
c(rep(7, 10))
## [1] 7 7 7 7 7 7 7 7 7 7
Create a vector filled with 5 times the same string:
c(rep("hello", 5))
## [1] "hello" "hello" "hello" "hello" "hello"
Create a vector filled with 2 times the first element, and 3 times the 2nd element:
c(rep(c("YES", "NO"), times = c(2, 3)))
## [1] "YES" "YES" "NO" "NO" "NO"
(For a really good video tutorial (including hands-on exercises) on missing and special values see here )
Missing values are represented by NA (not available),
but sometimes also by a dot (“.”) or an empty space.
The function is.na() helps you to find missing data in your dataset
# let's create a dataframe with 4 rows, 2 columns, and 3 missing values
mydata <- data.frame(A = c(2, 5, NA, 7), B = c(NA, 1, NA, 9))
mydata
## A B
## 1 2 NA
## 2 5 1
## 3 NA NA
## 4 7 9
# then check for missing data
is.na(mydata)
## A B
## [1,] FALSE TRUE
## [2,] FALSE FALSE
## [3,] TRUE TRUE
## [4,] FALSE FALSE
# To simply know if there are ANY missing data use any():
any(is.na(mydata))
## [1] TRUE
# For R, TRUE == 1, and FALSE == 0 Therefore, we can count the number of missing data with
sum(is.na(mydata))
## [1] 3
Another way to see the number of NAs is summary()
summary(mydata)
## A B
## Min. :2.000 Min. :1
## 1st Qu.:3.500 1st Qu.:3
## Median :5.000 Median :5
## Mean :4.667 Mean :5
## 3rd Qu.:6.000 3rd Qu.:7
## Max. :7.000 Max. :9
## NA's :1 NA's :2
complete.cases() tells you which rows do contain NAs (FALSE), and which ones do not (TRUE)
complete.cases(mydata)
## [1] FALSE TRUE FALSE TRUE
Use na.omit() to remove cases (i.e. rows) with NAs
# the result is a dataframe with only 2 rows and 2 columns
na.omit(mydata)
## A B
## 2 5 1
## 4 7 9
Many functions only work if you indicate to remove missing values, which you do with na.rm = TRUE:
mean(mydata$A) # returns NA
## [1] NA
mean(mydata$A, na.rm = TRUE) # this works
## [1] 4.666667
colMeans(mydata) # returns NA NA
## A B
## NA NA
colMeans(mydata, na.rm = TRUE) # this works
## A B
## 4.666667 5.000000
Many common statistical tests assume that the data is normally distributed, and that the variances are similar across groups (or conditions). Violation of these assumptions can lead to wrong results.
Luckily, there are several ways to “beat” (transform) data to make it more normal. Many types of data transformations can be used. For an overview, and a useful tool, see the package bestNormalize() Alternatively, use nonparametric statistics.
For example, RT data is typically right-skewed, but can become more normal if you apply a log() or log10() transformation.
# create right-skewed distribution using rexgauss() from the package retimes
rt_dist1 <- rexgauss(5000, 300, 100, 200, positive = T)
hist(rt_dist1, xlab = "RT in milliseconds", axes = F, main = "Positive Skewed")
# test for normality -> it is not normal
shapiro.test(rt_dist1)
##
## Shapiro-Wilk normality test
##
## data: rt_dist1
## W = 0.91562, p-value < 2.2e-16
# histogram of log-transformed data -> looks much more normal!
hist(log(rt_dist1), xlab = "RT in milliseconds", axes = F, main = "after log() transformation")
# test for normality -> still not normal
shapiro.test(log(rt_dist1))
##
## Shapiro-Wilk normality test
##
## data: log(rt_dist1)
## W = 0.987, p-value < 2.2e-16
Thus although the Shapiro-Wilks remained positive, the data looked much more normally distributed after logarithmic transformation.
Based on visual inspection of the histogram, and keeping in mind that for example ANOVA is quite robust against violating the normality assumption, one could go on and do statistics on this data (but not everybody might agree).
Also keep in mind that the shapiro.test() function does not work with vectors with more than 5000 elements :(
We have already seen how to select only parts of a vector or data frame
(e.g. mtcars[1:3, 1:3] selects the first 3 rows and columns)
It is also easy to exclude only a few columns and keep all the rest:
# say you want to remove the 2nd column (cyl) and the 4th column (hp)
mtcars_pruned <- mtcars[c(-2, -4)] # put a - in front of the columns you want to remove
head(mtcars_pruned)
## mpg disp drat wt qsec vs am gear carb
## Mazda RX4 21.0 160 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 160 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 108 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 258 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 360 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 225 2.76 3.460 20.22 1 0 3 1
# or we could do so by indicating the names of the columns; let's first find them
cols_to_remove <- which(colnames(mtcars) == "disp" | colnames(mtcars) == "hp") # the | sign means 'or'
# then remove them
mtcars_pruned <- mtcars[-cols_to_remove]
head(mtcars_pruned)
## mpg cyl drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 2.76 3.460 20.22 1 0 3 1
Another useful function is subset()
# to select the cars with 4 cylinders and at least 70 horse power:
subset(mtcars, cyl == 4 & hp > 70)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Use cbind() to put data frames next to each other (add columns), and rbind() below each other (add rows)
mtcars_wide <- cbind(mtcars_pruned, mtcars)
colnames(mtcars_wide)
## [1] "mpg" "cyl" "drat" "wt" "qsec" "vs" "am" "gear" "carb" "mpg"
## [11] "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
Or combine them in the other direction
mtcars_long <- rbind(mtcars[1:2, ], mtcars[31:32, ]) # stack the first 2 and the last 2 rows
dim(mtcars_long)
## [1] 4 11
But this does not work if the 2 data frames differ in the other dimension (number of variables)
# e.g. you cannot put a data frame with 9 columns on top of one with 11 columns
mtcars_doesntwork <- rbind(mtcars_pruned, mtcars) # you get an error
## Error in rbind(deparse.level, ...): numbers of columns of arguments do not match
Use cbind.fill() from package rowr to combine horizontally data frames with different N of rows.
E.g. this does not work: cbind(mtcars, iris)
library(rowr)
# ATTENTION: if you do not specify 'fill=NA' the rows of the smaller data frame will be repeated!
cars_and_flowers_wide <- cbind.fill(mtcars, iris, fill = NA)
cars_and_flowers_wide[32:33, 1:12]
## mpg cyl disp hp drat wt qsec vs am gear carb Sepal.Length
## 32 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2 5.4
## 33 NA NA NA NA NA NA NA NA NA NA NA 5.2
Use rbind.fill() from the package plyr to combine vertically data frames with different N of columns.
Again, this does not work: rbind(mtcars, iris)
library(plyr)
cars_and_flowers_long <- rbind.fill(mtcars, iris)
cars_and_flowers_long[32:33, 1:12]
## mpg cyl disp hp drat wt qsec vs am gear carb Sepal.Length
## 32 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2 NA
## 33 NA NA NA NA NA NA NA NA NA NA NA 5.1
Now let’s check which cases are complete (have no missing values)
dim(cars_and_flowers_wide)
## [1] 150 16
complete.cases(cars_and_flowers_wide) # (in this case only the first 32 cases are complete)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [23] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [122] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [144] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Select subset of non-missing cases in a data frame with na.omit()
dim(cars_and_flowers_wide)
## [1] 150 16
cars_and_flowers_wide_noNA = na.omit(cars_and_flowers_wide)
dim(cars_and_flowers_wide_noNA)
## [1] 32 16
Useful for example to combine RTs with questionnaire data (often in separate files)
# load dataset 'iris'
data(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
# split in 2 different datasets (but both contain the 5th column 'Species')
iris_1 <- iris[, c(1, 2, 5)]
iris_2 <- iris[, c(3, 4, 5)]
# then merge them again (order matters)
iris_merged <- merge(iris_2, iris_1, by = "Species")
head(iris_merged)
## Species Petal.Length Petal.Width Sepal.Length Sepal.Width
## 1 setosa 1.4 0.2 5.1 3.5
## 2 setosa 1.4 0.2 4.9 3.0
## 3 setosa 1.4 0.2 4.7 3.2
## 4 setosa 1.4 0.2 4.6 3.1
## 5 setosa 1.4 0.2 5.0 3.6
## 6 setosa 1.4 0.2 5.4 3.9
iris_3 <- iris[, c(1, 4, 5)]
iris_merged <- merge(iris_3, iris_2, by = c("Species", "Petal.Width"))
head(iris_merged)
## Species Petal.Width Sepal.Length Petal.Length
## 1 setosa 0.1 4.9 1.4
## 2 setosa 0.1 4.9 1.1
## 3 setosa 0.1 4.9 1.5
## 4 setosa 0.1 4.9 1.4
## 5 setosa 0.1 4.9 1.5
## 6 setosa 0.1 4.3 1.4
In the wide format, a subject’s repeated responses will be in a single row, and each response is in a separate column.
In the long format, each row is 1 time point x subject. So each subject will have data in multiple rows. Any variables that do not change across time (e.g. subjects’ gender) will have the same value in all the rows.
To change from LONG to WIDE use dcast() from the “reshape2” package
data("ChickWeight") # import the chicken weight data
# somehow the order of the levels of the factor $Chick are messed up, so let's change them
ChickWeight$Chick <- as.factor(as.numeric(ChickWeight$Chick))
some(ChickWeight)
## weight Time Chick Diet
## 26 39 2 14 1
## 58 199 18 18 1
## 120 43 0 13 1
## 269 40 0 28 2
## 316 233 21 26 2
## 405 98 8 33 3
## 440 78 6 35 3
## 447 250 20 35 3
## 489 131 8 43 4
## 522 82 6 47 4
length(unique(ChickWeight$Time))
## [1] 12
The weight of most chicken was measured at 12 different time points.
But some chicken were measured less often.
You can verify with the table() function
# see the first 6 rows (chicken). A 0 indicates that no measurement occurred
head(table(ChickWeight$Chick, ChickWeight$Time))
##
## 0 2 4 6 8 10 12 14 16 18 20 21
## 1 1 1 0 0 0 0 0 0 0 0 0 0
## 2 1 1 1 1 1 1 1 0 0 0 0 0
## 3 1 1 1 1 1 1 1 1 0 0 0 0
## 4 1 1 1 1 1 1 1 1 1 1 1 1
## 5 1 1 1 1 1 1 1 1 1 1 1 1
## 6 1 1 1 1 1 1 1 1 1 1 1 1
library(reshape2)
## Warning: package 'reshape2' was built under R version 3.5.3
# now transform to wide format using dcast():
Chicken_wide <- dcast(ChickWeight,
# after the name of the dataset put the variables to keep as they are
Chick + Diet
# then ~ name of variable to split into separate columns
~ Time,
# then name of the dependent variable
value.var = "weight")
head(Chicken_wide)
## Chick Diet 0 2 4 6 8 10 12 14 16 18 20 21
## 1 1 1 39 35 NA NA NA NA NA NA NA NA NA NA
## 2 2 1 41 45 49 51 57 51 54 NA NA NA NA NA
## 3 3 1 41 49 56 64 68 68 67 68 NA NA NA NA
## 4 4 1 41 48 53 60 65 67 71 70 71 81 91 96
## 5 5 1 42 51 59 68 85 96 90 92 93 100 100 98
## 6 6 1 41 47 54 58 65 73 77 89 98 107 115 117
# improve the header
colnames(Chicken_wide) # let's check the column names first
## [1] "Chick" "Diet" "0" "2" "4" "6" "8" "10"
## [9] "12" "14" "16" "18" "20" "21"
ts <- rep("t_", 12) # create a vector with 12 times "t_"
newnames <- paste(ts, colnames(Chicken_wide)[3:14], sep="") # concatenate "t_" with colnames
colnames(Chicken_wide)[3:14] <- newnames
head(Chicken_wide)
## Chick Diet t_0 t_2 t_4 t_6 t_8 t_10 t_12 t_14 t_16 t_18 t_20 t_21
## 1 1 1 39 35 NA NA NA NA NA NA NA NA NA NA
## 2 2 1 41 45 49 51 57 51 54 NA NA NA NA NA
## 3 3 1 41 49 56 64 68 68 67 68 NA NA NA NA
## 4 4 1 41 48 53 60 65 67 71 70 71 81 91 96
## 5 5 1 42 51 59 68 85 96 90 92 93 100 100 98
## 6 6 1 41 47 54 58 65 73 77 89 98 107 115 117
Use the melt() function from the “reshape2” package
head(Chicken_wide)
## Chick Diet t_0 t_2 t_4 t_6 t_8 t_10 t_12 t_14 t_16 t_18 t_20 t_21
## 1 1 1 39 35 NA NA NA NA NA NA NA NA NA NA
## 2 2 1 41 45 49 51 57 51 54 NA NA NA NA NA
## 3 3 1 41 49 56 64 68 68 67 68 NA NA NA NA
## 4 4 1 41 48 53 60 65 67 71 70 71 81 91 96
## 5 5 1 42 51 59 68 85 96 90 92 93 100 100 98
## 6 6 1 41 47 54 58 65 73 77 89 98 107 115 117
data_long <- melt(Chicken_wide,
# id.variables = all the variables to keep w/o splitting
id.vars=c("Chick", "Diet"),
# measure.vars = the columns to put from wide into long format
measure.vars=c(colnames(Chicken_wide)[3:14]),
# Name of the destination column
variable.name="Time",
# column that the measurement came from
value.name="weight")
head(data_long)
## Chick Diet Time weight
## 1 1 1 t_0 39
## 2 2 1 t_0 41
## 3 3 1 t_0 41
## 4 4 1 t_0 41
## 5 5 1 t_0 42
## 6 6 1 t_0 41
If you leave out measure.vars melt() will automatically use all the remaining variables as id.vars.
The reverse is true if you leave out id.vars.
##
## -----------------------------
## Operator Description
## ---------- ------------------
## + addition
##
## - subtraction
##
## * multiplication
##
## / division
##
## ^ or ** exponentiation
##
## %% modulus
##
## %/% integer division
## -----------------------------
Examples with arithmetic operators
# divide column hp by column cyl
mtcars$hp/mtcars$cyl
## [1] 18.33333 18.33333 23.25000 18.33333 21.87500 17.50000 30.62500
## [8] 15.50000 23.75000 20.50000 20.50000 22.50000 22.50000 22.50000
## [15] 25.62500 26.87500 28.75000 16.50000 13.00000 16.25000 24.25000
## [22] 18.75000 18.75000 30.62500 21.87500 16.50000 22.75000 28.25000
## [29] 33.00000 29.16667 41.87500 27.25000
# use the modulus to check which numbers in column gear are odd or even
mtcars$gear%%2 # 0 is even, 1 is odd
## [1] 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 1 1 1 1 1 0
##
## --------------------------------------
## Operator Description
## ---------- ---------------------------
## < less than
##
## <= less than or equal to
##
## > greater than
##
## >= greater than or equal to
##
## == exactly equal to
##
## != not equal to
##
## ! not (something)
##
## | or
##
## & and
##
## isTRUE() test if something is true
## --------------------------------------
Examples with logical operators
# find which cars have 2 carburetors
mtcars$carb == 2
## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
## [23] TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE TRUE
# find which cars DO NOT HAVE 2 carburetors
mtcars$carb != 2
## [1] TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
## [12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE
## [23] FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE
# or
!mtcars$carb == 2
## [1] TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
## [12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE
## [23] FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE
# find cars that have at least 100 hp and less than 8 cylinders
(mtcars$hp >= 100 & mtcars$cyl < 8)
## [1] TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE TRUE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE
Use function aggregate() in package stats
# load data 'ChickWeight', which contains Weight versus age of chicks on different diets
data(ChickWeight)
head(ChickWeight)
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
Weight_by_Diet <- aggregate(ChickWeight$weight, by = list(Diet = ChickWeight$Diet), FUN = mean, na.rm = TRUE)
colnames(Weight_by_Diet)[2] <- "mean weight" # let's add a nicer header for the 2nd column
Weight_by_Diet
## Diet mean weight
## 1 1 102.6455
## 2 2 122.6167
## 3 3 142.9500
## 4 4 135.2627
You could also aggregate by time instead
Weight_by_Time <- aggregate(ChickWeight$weight, by = list(Time = ChickWeight$Time), FUN = mean, na.rm = TRUE)
colnames(Weight_by_Time)[2] <- "mean weight"
Weight_by_Time
## Time mean weight
## 1 0 41.06000
## 2 2 49.22000
## 3 4 59.95918
## 4 6 74.30612
## 5 8 91.24490
## 6 10 107.83673
## 7 12 129.24490
## 8 14 143.81250
## 9 16 168.08511
## 10 18 190.19149
## 11 20 209.71739
## 12 21 218.68889
Or aggregate using a different function (e.g. SD)
Weight_by_Time <- aggregate(ChickWeight$weight, by = list(Time = ChickWeight$Time), FUN = sd, na.rm = TRUE)
colnames(Weight_by_Time)[2] <- "SD time"
Weight_by_Time
## Time SD time
## 1 0 1.132272
## 2 2 3.688316
## 3 4 4.495179
## 4 6 9.012038
## 5 8 16.239780
## 6 10 23.987277
## 7 12 34.119600
## 8 14 38.300412
## 9 16 46.904079
## 10 18 57.394757
## 11 20 66.511708
## 12 21 71.510273
Or aggregate by 2 variables (Time & Diet)
Weight_by_Time_and_Diet <- aggregate(ChickWeight$weight, by = list(Time = ChickWeight$Time, Diet = ChickWeight$Diet), FUN = mean,
na.rm = TRUE)
colnames(Weight_by_Time_and_Diet)[3] <- "Mean Weight"
some(Weight_by_Time_and_Diet)
## Time Diet Mean Weight
## 4 6 1 66.78947
## 8 14 1 123.38889
## 11 20 1 170.41176
## 15 4 2 59.80000
## 22 18 2 187.70000
## 30 10 3 117.10000
## 35 20 3 258.90000
## 36 21 3 270.30000
## 42 10 4 126.00000
## 45 16 4 182.00000
Or use ddply() in package plyr, which also allows to simultaneously compute N, M, and SD
library(plyr)
Weight_by_Time_and_Diet <- ddply(ChickWeight, .(Diet, Time), summarise, N = length(weight), M = round(mean(weight, na.rm = TRUE),
2), SD = round(sd(weight, na.rm = TRUE), 2))
head(Weight_by_Time_and_Diet)
## Diet Time N M SD
## 1 1 0 20 41.40 0.99
## 2 1 2 20 47.25 4.28
## 3 1 4 19 56.47 4.13
## 4 1 6 19 66.79 7.76
## 5 1 8 19 79.68 13.78
## 6 1 10 19 93.05 22.54
tail(Weight_by_Time_and_Diet)
## Diet Time N M SD
## 43 4 12 10 151.40 15.28
## 44 4 14 10 161.80 15.73
## 45 4 16 10 182.00 25.30
## 46 4 18 10 202.90 33.56
## 47 4 20 9 233.89 37.57
## 48 4 21 9 238.56 43.35
Plotting (or data visualisation) is essential to
1) better understand your data
2) summarizing and conveying the results of your analyses
There are several sets of plotting functions in R.
We will start with the “Base graphics” system, which offers quick but limited plotting possibilities.
The “ggplot2” system, which we’ll discuss further on, is a bit more complicated to use but provides much more options and possibilities.
When you create a plot in RStudio, it is automatically shown in the “Plots” pan on the lower right.
From there, you could export it to an image (e.g. jpg or png) or a .pdf using the GUI (later, we will see other ways how to do this)
See here for a hands-on tutorial on plotting and data visualization in R
Scatterplots can easily be made using the plot() function. This is a generic function, meaning that it can adapt to different data types. With it, one can create scatterplots, dot plots, line plots, etc.
Let’s do a simple scatterplot with the 2 variables “hp” (horsepower) and “mpg” (miles per gallon) from the dataset mtcars
plot(mtcars$hp, mtcars$mpg)
As expected, there seems to be a negative correlation between horsepower and the number of miles you can do with one gallon of gas, i.e. the stronger the motor, the higher the gas consumption and thus the lower the miles per gallon.
Now let’s modify the axis labels and add a title
plot(mtcars$hp, mtcars$mpg, xlab = "Horse Power", ylab = "Miles x Gallon", main = "Horse Power by Miles x Gallon in mtcars")
Notice that you could plot the same by using R’s formula interface:
plot(mpg ~ hp, data = mtcars), in this case the y-axis is defined first.
To see more options of the plot() function type ?graphics::plot and Enter in the console
Sets graphics parameters (fonts, colors, axes, titles), and keeps them until the next par call.
Useful for example to put multiple plots in the same figure.
# create a plot array with 1 row and 2 columns
par(mfrow = c(1, 2))
plot(mtcars$hp, mtcars$mpg, xlab = "Horse Power", ylab = "Miles x Gallon", main = "plot 1")
plot(mtcars$cyl, mtcars$carb, xlab = "cylinders", ylab = "carburetors", main = "plot 2")
But there are many more parameters. Check the current ones with par()
See this link for a description.
CAREFUL: you cannot use parto combine plots made with ggplot2 (see below).
In order to explore the data, one could also use scatterplots comparing more than 2 variables.
Let’s create a plot array with the first 4 variables in the mtcars dataset:
plot(mtcars[, 1:4])
Could be done with the plot() function, putting the “type” argument to “l”
# let's order mtcars by nunber of cylinders
mtcars_sorted <- mtcars[order(mtcars$cyl), ]
plot(mtcars_sorted$hp, type = "l", xlab = "cars ordered by cyl", ylab = "horsepower", main = "this graphs doesn't make much sense")
# possible types are 'p' for points, 'l' for lines, 'b' for both, etc. -> see ?plot()
Or do first a scatterplot with plot() and then add the lines with lines()
plot(mtcars_sorted$hp)
lines(mtcars_sorted$hp, type = "l")
Sunflowerplots are useful when the same data combination gets repeated many times.
For each repetition, a petal is added:
* a black dot without petals = 1 single data point
* 2 red petals = 2 identical data points
* 3 red petals = 3 identical data points, etc.
sunflowerplot(mtcars$cyl, mtcars$carb, xlab = "number of cylinders", ylab = "number of carburetors", main = "Sunflowerplot of cyl by carb")
Useful to see the distribution of a numerical variable changes over the levels of a categorical variable, or another numerical variable that has only a few values.
The box of the boxplot includes: 1st quartile, median, and 3rd quartile.
By default when using boxplot() the whiskers extend to the most extreme data point which is no more than 1.5 times the interquartile range from the box.
Everything beyond is an outlier (marked with points).
The whiskers can however be changed, e.g. range = 0 causes the whiskers to extend to the data extremes.
Let’s see gas consumption by number of cylinders
boxplot(mtcars$mpg ~ mtcars$cyl, xlab = "cylinders", ylab = "miles per gallon")
# now extend the whiskers to the data extremes (no more outliers)
boxplot(mtcars$mpg ~ mtcars$cyl, range = 0, xlab = "cylinders", ylab = "miles per gallon")
Like a scatterplot between 2 categorical variables, or between numerical values with only a few values.
A line is shown for when a category is absent.
mosaicplot(cyl ~ carb, data = mtcars)
These look cool but are not recommended.
Even the help (?pie) for this function thinks so:
“Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data.”
# first let's get a count of the number of cars per number of carburetors
tbl <- table(mtcars$carb)
# then let's make a pie chart out of it
pie(tbl)
title("amount of cars with 1,2,3,4,6, or 8 carburetors")
Now we show the same data as a bar plot instead. Remember that ideally you’d also add errorbars - this is just a simple example!
tbl <- table(mtcars$carb)
# then let's make a bar chart out of it
barplot(tbl, xlab = "number of carburetors", ylab = "number of cars")
title("amount of cars by carburetors")
This is actually a histogram!
Shows the frequencies of occurrences for each level of a given variable.
Useful also to quickly see the distribution of the data (is it normal?).
Let’s again plot the number of cars that have a specific amount of carburetors, but this time with the function hist()
# ATTENTION: if you do not want to get the counts of cars with 1 or 2 carburetors to be summed together, you need to
# change the breaks!!
hist(mtcars$carb, breaks = seq(0.5, 8.5, 1), freq = TRUE)
A common assumption necessary for many statistical analyses is that the data is approximately normally distributed.
There are several ways of testing this. One of them is the qqPlot (quantile-quantile plot).
For the qqplot we sort our data, plot it against a specially-designed x-axis based on our reference distribution (e.g., the Gaussian “bell curve”), and look to see whether the points lie approximately on a straight line.
If the data is normally distributed, the points will align on the diagonal of the qqplot
The variable “weight” in the “ChickWeight” dataset does not follow a normal distribution
# load the 'car'' package to use the function qqPlot()
par(mfrow = c(1, 2))
hist(ChickWeight$weight)
qqPlot(ChickWeight$weight)
## [1] 400 399
But if we select only measures at Time == 16,
then the assumption of normality seems more reasonable
index16 = which(ChickWeight$Time == 16)
weights = ChickWeight[index16, "weight"]
par(mfrow = c(1, 2))
hist(weights, breaks = seq(1, 320, 20))
qqPlot(weights)
## [1] 32 18
And here an example of data that is definitely not normally distributed
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
##
## select
par(mfrow = c(1, 2))
hist(Boston$tax)
qqPlot(Boston$tax)
## [1] 489 490
ggplot2 was written by Hadley Wickham, a kiwi statistician, professor, and chief scientist at RStudio (check out his website)
It is “a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics.” (http://ggplot2.org)
Plotting functions in base R can still be useful to quickly explore the data.
But if you want anything beyond those simple graphs, it’s best to switch to ggplot2.
“ggplot2 provides a unified interface and set of options, instead of the grab bag of modifiers and special cases required in base graphics. Once you learn how ggplot2 works, you can use that knowledge for everything from scatter plots and histograms to violin plots and maps.” (R graphics cookbook by Winston Chang, p.7)
“In R’s base graphics functions, each mapping of data properties to visual properties is itw own special case, and changing the mappings may require restructuring your data, issuing completely different graphics commands, or both.” (R graphics cookbook by Winston Chang, p.374)
CAREFUL: you cannot use parto combine plots made with ggplot2 (see below). Use the cowplotor gridExtrapackages instead.
DISCLAIMER: I have copied a lot of examples from the R graphics cookbook by Winston Chang!
# Create a 2x3 data frame
mydata <- matrix(c(10, 7, 12, 9, 11, 6), nrow = 2, ncol = 3, byrow = TRUE, dimnames = list(c("B1", "B2"), c("A1", "A2", "A3")))
mydata
## A1 A2 A3
## B1 10 7 12
## B2 9 11 6
Make a barplot with the data using function barplot() with As along x axis, and bars grouped by Bs
graphics::barplot(mydata, beside = TRUE) # beside=TRUE draws separate bars for each data point, otherwise B1 and B2 overlap with diff colors
To turn the barplot around (Bs along the x axis and bars grouped by As), you first need to restructure the data (e.g. using the function t() for transpose )
t(mydata)
## B1 B2
## A1 10 9
## A2 7 11
## A3 12 6
Then you can do another barplot:
barplot(t(mydata), beside = TRUE)
So far so good, transposing the data wasn’t that bad, one could say.
But things get more complicated if you want to make a lineplot.
To do this with base graphics you need to draw the lines for B1 and B2 in separate commands.
plot(mydata[1, ], type = "l")
lines(mydata[2, ], type = "l", col = "blue")
And the result is not very convincing:
For ggplot2 the data must be in “long” format, and you never need to change it.
So first we put mydata in long format using melt() from the reshape2 package:
library(reshape2)
mydata_long <- melt(mydata, value.name = "value")
colnames(mydata_long)[1:2] <- c("theBs", "theAs")
mydata_long
## theBs theAs value
## 1 B1 A1 10
## 2 B2 A1 9
## 3 B1 A2 7
## 4 B2 A2 11
## 5 B1 A3 12
## 6 B2 A3 6
Now comes the barplot using ggplot(), and it will look better right away (e.g. notice the axis labels and the legend):
ggplot(mydata_long, aes(x = theAs, y = value, fill = theBs)) + geom_bar(stat = "identity", position = "dodge")
Turning it around and putting the Bs on the x-axis is simple:
ggplot(mydata_long, aes(x = theBs, y = value, fill = theAs)) + geom_bar(stat = "identity", position = "dodge")
If you wanted a line plot instead:
ggplot(mydata_long, aes(x = theBs, y = value, colour = theAs, group = theAs)) + geom_line()
Or this way?
ggplot(mydata_long, aes(x = theAs, y = value, colour = theBs, group = theBs)) + geom_line()
Many of the elements of ggplots are combined by joining them with the plus sign (+)
If you write the ggplot command over several lines, be careful to always put + at the end of the line!
Let’s create some new data for a scatterplot
mydata <- data.frame(xval = 1:4, yval = c(3, 5, 6, 9), group = c("A", "B", "A", "B"))
mydata
## xval yval group
## 1 1 3 A
## 2 2 5 B
## 3 3 6 A
## 4 4 9 B
Now we make a simple scatterplot with ggplot(),
but before plotting it we save it as “myplot”
myplot <- ggplot(mydata, aes(x = xval, y = yval)) + geom_point()
myplot # then plot it by calling its name
You can change or add more elements to the saved plot using +
For example let’s plot groups A and B in different colors:
myplot + geom_point(aes(colour = group))
Or you can change the range of the x-axis:
myplot + scale_x_continuous(limits = c(-5, 10))
Change the colors:
myplot + geom_point(aes(colour = group)) + scale_colour_manual(values = c("orange", "forestgreen"))
plot() vs. ggplot()plot(mtcars$wt, mtcars$mpg)
library(ggplot2)
ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + theme(panel.background = element_blank()) + # to remove the gray background
theme(axis.line = element_line(colour = "black")) # otherwise the axes are nolonger visible
# let's create some data
mydata <- data.frame(group = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"), Time = as.factor(c(1, 2, 1, 2,
1, 2, 1, 2, 1, 2, 1, 2)), yval = c(11, 3, 5, 6, 9, 1, 2, 3, 4, 5, 1, 0))
mydata
## group Time yval
## 1 A 1 11
## 2 A 2 3
## 3 A 1 5
## 4 A 2 6
## 5 A 1 9
## 6 A 2 1
## 7 B 1 2
## 8 B 2 3
## 9 B 1 4
## 10 B 2 5
## 11 B 1 1
## 12 B 2 0
There are (at least) 2 ways to add errorbars representing standard error of the mean (SE).
Here we will discuss both and put the resulting plots next to each other (using par()).
The 1st way is to compute the SEs yourself (SD/sqrt(N)), and then plot them with geom_errorbar
mydata_summary <- ddply(mydata, c("group", "Time"), summarise, M = mean(yval, na.rm = TRUE), SD = sd(yval, na.rm = TRUE),
N = sum(!is.na(yval)), SE = SD/sqrt(N))
mydata_summary # look at the computed parameters
## group Time M SD N SE
## 1 A 1 8.333333 3.055050 3 1.7638342
## 2 A 2 3.333333 2.516611 3 1.4529663
## 3 B 1 2.333333 1.527525 3 0.8819171
## 4 B 2 2.666667 2.516611 3 1.4529663
library(ggplot2) # load ggplot2, if necessary
errorbars1 <- ggplot(mydata_summary, aes(x = Time, y = M, fill = group)) + geom_bar(stat = "identity", position = position_dodge(0.5),
width = 0.5) + geom_errorbar(aes(ymin = M - SE, ymax = M + SE), position = position_dodge(0.5), width = 0.2) + ggtitle("SEs by hand")
# position_dodge defines the spacing between bars (per default it is the same as the bar width, i.e. 0.9)
# position='dodge' is the short form of position=position_dodge()
The 2nd (easier) way is to have ggplot do it for you using fun.data = mean_se:
errorbars2 <- ggplot(mydata, aes(x = Time, y = yval, fill = group)) + stat_summary(fun.y = "mean", geom = "bar", position = position_dodge(0.5),
width = 0.5) + # the height of the bars represents the mean per Time and group
stat_summary(fun.data = mean_se, geom = "errorbar", position = position_dodge(0.5), width = 0.2) + # the errorbars show the SE
ggtitle("Automatic SEs")
We use the package cowplot (see here) to arrange the 2 plots next to each other, and label them with A and B.
Using this package also makes the plots a bit nicer (white background, etc.)
####### if necessary install and/or load cowplot: install.packages('cowplot')
library(cowplot)
plot_grid(errorbars1, errorbars2, labels = c("A", "B"))
Reassuringly, the 2 graphs look exactly the same.
# ATTENTION: for the x-axis you need a factor, or a vector of characters
ggplot(mtcars, aes(as.factor(cyl), mpg)) +
geom_boxplot() +
geom_jitter(width = 0.2) + # we can move each data point a bit to the right or left, to make it visible
labs(x = "N of cylinders")
# plot
ggplot(mtcars, aes(x = as.factor(cyl), y = mpg, fill = as.factor(vs))) + geom_boxplot() + geom_point(alpha = 0.5, position = position_jitterdodge(jitter.width = 0.2),
aes(group = as.factor(vs)))
The better boxplot, as it also shows the distribution of the data (a kernel density plot).
# ATTENTION: for the x-axis you need a factor, or a vector of characters
ggplot(mtcars, aes(as.factor(cyl), mpg)) +
geom_violin() +
geom_jitter(width = 0.2) + # we can move each data point a bit to the right or left, to make it visible
labs(x = "N of cylinders")
We can overlay box plots and violin plots made with ggplot2. For that, we need to specify their positions using te function position_dodge
ggplot(mtcars, aes(as.factor(cyl), mpg)) +
geom_violin(position = position_dodge(width = .04)) +
geom_boxplot(width = .05, position = position_dodge(width = .04)) +
geom_jitter(width = 0.2) + # we can move each data point a bit to the right or left, to make it visible
labs(x = "N of cylinders")
The function pirateplot from the yarrr package allows to easily combine box plots, violin plots, and the representation of every single data point. There are many built-in color palettes you can choose from
library("yarrr")
pirateplot(formula = mpg ~ cyl, data = mtcars, pal = "southpark", main = "mpg by cyl in mtcars", ylim = c(5, 45))
You can have up to 3 different categorigal IVs for pirate plots
pirateplot(formula = mpg ~ cyl + gear, data = mtcars, pal = "southpark", main = "mpg by cyl and by gears", ylim = c(5, 45))
But, as far as I know, you cannot assign a pirateplot to an object, as you can do with ggplot2. This means that it is difficult to arrange several pirate plots in the same figure (see below).
If you plot with the base graphics package, you can put several plots in 1 figure using the par function But this does not work with graphs made with ggplot2, for which one can use the cowplotor gridExtrapackages instead.
library(cowplot)
plot1 <- ggplot(mtcars, aes(cyl, mpg)) +
geom_bar(stat="summary", fun.y = "mean") +
labs(x = "N of cylinders") +
labs(y = "Average mpg")
plot2 <- ggplot(mtcars, aes(as.factor(cyl), mpg)) +
geom_violin(position = position_dodge(width = .04)) +
geom_boxplot(width = .05, position = position_dodge(width = .04)) +
geom_jitter(width = 0.2) + # we can move each data point a bit to the right or left, to make it visible
labs(x = "N of cylinders")
# let's put plot1 and plot2 together, but in positions 1 and 4 of a 4-plot scenario
plot_grid(plot1, NULL, NULL, plot2, labels = c("A", "B", "C", "D"), ncol = 2)
You can save your graphs by hand using the Export button on the right-hand Plots tab.
Or you define in your script how and where to save a graph (see link).
Here, I save to .pdf, by first defining the name (with the path!),
calling pdf() before the plot,
and dev.off() after the plot.
Beware that if you save directly to a file, you will not see the plot being drawn in R!
SaveDir <- "YOUR_PATH"
library(cowplot)
name <- paste(SaveDir, "/my_amazing_plot.pdf", sep = "")
pdf(name, width = 15, height = 8) # alternatively used jpeg() or png()
plot_grid(errorbars1, errorbars2, labels = c("A", "B"))
dev.off()
When you need to do the same operation (e.g. importing a subject’s data) many times, you might want to use a for loop.
For loops are part of “flow” commands, and execute according to a counter.
For example in this for loop, we start with the number 1, and then add +1 to it for 5 times. We display number at each iteration
number <- 1
for (counter in 1:5) {
number <- number + 1
print(number)
}
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
For loops can also be nested. for example, here we first add +1 to a number, and then write the word “Pizza” from once to “number” times next to each other
number <- 1
for (counter in 1:5) {
number <- number + 1
print(number)
for (counter2 in 1:number) {
print(rep("Pizza", counter2))
}
}
## [1] 2
## [1] "Pizza"
## [1] "Pizza" "Pizza"
## [1] 3
## [1] "Pizza"
## [1] "Pizza" "Pizza"
## [1] "Pizza" "Pizza" "Pizza"
## [1] 4
## [1] "Pizza"
## [1] "Pizza" "Pizza"
## [1] "Pizza" "Pizza" "Pizza"
## [1] "Pizza" "Pizza" "Pizza" "Pizza"
## [1] 5
## [1] "Pizza"
## [1] "Pizza" "Pizza"
## [1] "Pizza" "Pizza" "Pizza"
## [1] "Pizza" "Pizza" "Pizza" "Pizza"
## [1] "Pizza" "Pizza" "Pizza" "Pizza" "Pizza"
## [1] 6
## [1] "Pizza"
## [1] "Pizza" "Pizza"
## [1] "Pizza" "Pizza" "Pizza"
## [1] "Pizza" "Pizza" "Pizza" "Pizza"
## [1] "Pizza" "Pizza" "Pizza" "Pizza" "Pizza"
## [1] "Pizza" "Pizza" "Pizza" "Pizza" "Pizza" "Pizza"
With if statements, we can take decisions, and act accordingly. For example, we could ask if a condition is met (e.g. a number is odd or even), and only if this is TRUE, another statement gets executed
Here we print “Tiramisu” only if the number is odd, otherwise nothing happens
number <- 5
if ((number%%2) == 1) {
print("Tiramisu")
}
## [1] "Tiramisu"
Similar to if(), but this time something happens also when the statement is NOT TRUE
e.g. print “Tiramisu” for the number 2, and “Mousse au chocolat” for all other numbers
number <- 5
if (number == 2) {
print("Tiramisu")
} else {
print("Mousse au chocolat")
}
## [1] "Mousse au chocolat"
Similar to if…else, but more specific.
e.g. print “One” for the number 1, “Two” for the number 2, and “and so forth” for all other numbers
number <- 5
if (number == 1) {
print("One")
} else if (number == 2) {
print("Two")
} else {
print("and so forth")
}
## [1] "and so forth"
Sort of shorthand to if…else statement
Tests for each element of a vector (e.g. a column of a dataframe) if a statement is true or not, and acts accordingly to instructions.
Useful for example to changethe names of the levels of a factor (e.g. imagine you have an IV “Time” with levels 1, 2, and 3; you change them to t_1, t_2, t_3).
The syntax is ifelse(test_expression, x, y)
# check if these numbers are bigger than five. if yes write you are so big, otherwise write you are very tiny
numbers <- c(3, 9, 4, 8, 1, 20)
ifelse(numbers > 5, "you are so biiig", "you are very tiny")
## [1] "you are very tiny" "you are so biiig" "you are very tiny"
## [4] "you are so biiig" "you are very tiny" "you are so biiig"
# in a data frame, replace small numbes (<5) with 'too small', and big numbers (>5) with 'too large'
numbers <- as.data.frame(c(3, 9, 4, 8, 1, 20))
colnames(numbers) <- "number"
numbers$number <- ifelse(numbers$number > 5, "too small", "too large")
ifelse() statements can be nested
# in a data frame, replace 'to do R' with 'Yuppee', 'not to do R' with 'Baaah', and 'maybe do R' with 'there is hope'
data <- as.data.frame(c("to do R", "not to do R", "maybe do R", "to do R", "maybe do R", "to do R"))
colnames(data) <- "motivation"
data$motivation <- ifelse(data$motivation == "to do R", "Yuppee", ifelse(data$motivation == "not to do R", "Baaah", "there is hope"))
Sometimes you want to create a new variable name for each iteration of a for loop.
For example, to save a plot, which will differ for each iteration, in its own variable.
This will later allow you to manipulate and plot in the same figure, etc.
Use the assign() variable
library(ggplot2)
for (iteration in c(4, 6, 8)) {
# select cars with that many cylinders
temp <- as.data.frame(mtcars[mtcars$cyl == iteration, "mpg"])
colnames(temp) <- "mpg"
# create plot
myplot <- ggplot(temp, aes(y = mpg)) + geom_boxplot() + ggtitle(paste(iteration, "cylinders"))
# create a variable name and save the plot in it
var_name <- paste("plot_", iteration, sep = "")
assign(var_name, myplot)
}
cowplot::plot_grid(plot_4, plot_6, plot_8, ncol = 3)
Create a command, save it as a function, and execute it when you want.
Create the function “desert”, which takes a number, and writes “Tiramisu” if the input is 2, or “Lasagne” for all other numbers
desert <- function(number) {
if (number == 2) {
print("Tiramisu")
} else {
print("Lasagne")
}
}
Then use the function by feeding it with different numbers and reading the output
desert(1)
## [1] "Lasagne"
desert(2)
## [1] "Tiramisu"
you can also write the function in an .R script, which you save with the same name (desert.R)
Then, in order to use desert(), source it in R, i.e. tell R the directory you saved it in:
# e.g. if I saved the script as 'desert.R' in the folder 'Z:/Myfolder'
source("Z:/Myfolder/desert.R")
desert(4/2)
## [1] "Tiramisu"
You can wait for the user to write some text or numbers, and then use that in further commands
for example, these 2 lines of code ask you what you want to eat, and then display your text input
text <- readline(prompt = "Hello! What do you want to eat today?: ")
print(text)
The input could also go straight into a function.
Here, answer y for yes, and n for no, to the question “Do you like R?” What is the answer you got?
R_question <- function() {
text <- readline(prompt = "Do you like R? (y/n): ")
if (text == "y") {
return("Yeah you rule!")
} else if (text == "n") {
return("Baaaaah!!")
}
}
# run the function
R_question()
Of course we could write the R_question() function in a script, save it as R_question.R, tell R where it is (using source()), and then use it also in future R sessions, without the need of recreating the function first.
Tidyverse is a group of R packages, which share the same “grammar” and philosophy. Tidyverse also includes ggplot2(), which we have seen earlier for plotting, and dplyr(), which is similar to plyr(), which we used earlier with the function ddply() to quickly get the means by subject and/or condition. You can install the packages of Tidyverse with install.packages("tidyverse").
Pipes are a type of syntax, that are part of the tidyverse() package/coding style, created by Hadley Wickham. The advantage of pipes is that they make reading of code much easier to understand. While in conventional R coding you put functions into functions, etc., and therefore have to read “from the inside out”, pipes allow you to write the code in a way that reflects the order of operations, resembles more the English language, and is therefore much easier to understand.
One neat example is given in this lecture by Hadley Wickham, aroung minute 21.
Imagine a little bunny called foo_foo, who is running through the forest, scooping up a field mouse, and bop it on the head…
foo_foo <- little_bunny()
# without the pipe (conventional writing)
bop_on(scoop_up(hop_through(foo_foo, forest), field_mouse), head)
# with the pipe
foo_foo %>% hop_through(forest) %>% scoop_up(field_mouse) %>% bop_on(head)
ATTENTION: you might want to use the shortcut for the pipe, which on Windows is Strg+Shift+M
For help on which statistical test to use, and how to implement it in R (or SAS, Stata, or SPSS), have a look at this link But let’s start with descriptive statistics.
# on average, how many miles with 1 gallon can you ride with the cars in the mtcars dataset?
mean(mtcars$mpg)
## [1] 20.09062
# what is the maximum of miles you can do with 1 gallon?
max(mtcars$mpg)
## [1] 33.9
# and the minimum?
min(mtcars$mpg)
## [1] 10.4
# how many car models with 4, 6, or 8 cylinders?
table(mtcars$cyl)
##
## 4 6 8
## 11 7 14
With the package ddply you can easily obtain a lot of descriptive statistics grouped in a neat and convenient way. For example, we could want to know how many cars with different numbers of cylinders exist in the mtcars dataset, what the percentage of cars with 4, 6, or 8 cylinders is, and what their minimum, maximum, and average miles per gallon value is.
means <- ddply(mtcars, c("cyl"), summarise, N_cars = length(cyl), percentage_cars = round((100/nrow(mtcars) * N_cars)), mpg_min = min(mpg),
mpg_max = max(mpg), mpg_mean = round(mean(mpg)))
means
## cyl N_cars percentage_cars mpg_min mpg_max mpg_mean
## 1 4 11 34 21.4 33.9 27
## 2 6 7 22 17.8 21.4 20
## 3 8 14 44 10.4 19.2 15
compared the observed distribution to an expected one. For example, check if the number of female participants differs between 2 or more groups.
nums <- c(42, 44, 40) # number of female participants in 3 groups
res <- chisq.test(nums, p = c(1/3, 1/3, 1/3)) # we expect the number of females to be equal across groups
res$expected # what we expect
## [1] 42 42 42
res
##
## Chi-squared test for given probabilities
##
## data: nums
## X-squared = 0.19048, df = 2, p-value = 0.9092
A correlation measures the strength and the direction of an association between 2 variables.
The strength varies between +1 and -1, where 1 indicates a perfect degree of association and 0 no association at all. The direction can be positive (+) or negative (-).
Common measures of correlations are called: Pearson and Spearman.
For the Pearson’s r correlation, both variables should be normally distributed.
Spearman’s rho instead is a non-parametric rank correlation (data is ranked), which does NOT require the data to be normally distributed.
r and rho are also effect sizes (they should therefore be writte in italic).
According to Cohen, Pearsons’s r correlation coefficients can be roughly classified like this (the same categorization applies to Spearman’s rho):
The square of r (r2) determines how much variance is shared by the 2 correlated variables
To correlate several variables with each other, use cor()from the stats package
# here we correlate the last 5 columns of the mtcars dataset with each other
cor(mtcars[, 4:8], use = "complete.obs", method = "spearman")
## hp drat wt qsec vs
## hp 1.0000000 -0.52012499 0.7746767 -0.66660602 -0.7515934
## drat -0.5201250 1.00000000 -0.7503904 0.09186863 0.4474575
## wt 0.7746767 -0.75039041 1.0000000 -0.22540120 -0.5870162
## qsec -0.6666060 0.09186863 -0.2254012 1.00000000 0.7915715
## vs -0.7515934 0.44745745 -0.5870162 0.79157148 1.0000000
# with 'complete.obs' missing values are handled by casewise deletion
To see the p value use cor.test() (but only works for 1 coefficient)
cor.test(mtcars[, 4], mtcars[, 5], use = "complete.obs", method = "spearman")
##
## Spearman's rank correlation rho
##
## data: mtcars[, 4] and mtcars[, 5]
## S = 8293.8, p-value = 0.002278
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.520125
Even better is function rcorr() from the Hmisc package. It gives correlation coefficients and significance levels together.
library(Hmisc)
rcorr(as.matrix(mtcars[, 4:8]), type = "spearman") # Input must be a matrix!
## hp drat wt qsec vs
## hp 1.00 -0.52 0.77 -0.67 -0.75
## drat -0.52 1.00 -0.75 0.09 0.45
## wt 0.77 -0.75 1.00 -0.23 -0.59
## qsec -0.67 0.09 -0.23 1.00 0.79
## vs -0.75 0.45 -0.59 0.79 1.00
##
## n= 32
##
##
## P
## hp drat wt qsec vs
## hp 0.0023 0.0000 0.0000 0.0000
## drat 0.0023 0.0000 0.6170 0.0102
## wt 0.0000 0.0000 0.2148 0.0004
## qsec 0.0000 0.6170 0.2148 0.0000
## vs 0.0000 0.0102 0.0004 0.0000
if you quickly wanted to convey to the reader the strenght of correlations between several variables, you should use plots.
We have discussed before the scatterplot. But a fancier, and sometimes more useful, way of showing correlation strenghts is a corrplot
Use the function corrplot() from the package with the same name. The size and color of the circles indicate the strength and direction of the correlation. check out the many options.
# install.packages('corrplot')
library(corrplot)
mycorrs <- cor(mtcars[, 4:8], use = "complete.obs", method = "spearman")
corrplot(mycorrs)
Othertimes so-called heat maps are useful to quickly spot clusters of associations amongst many variables.
For a famous example, check out the paper “Matching Categorical Object Representations in Inferior Temporal Cortex of Man and Monkey”, published in Neuron by Kriegeskorte and colleagues (2008)
Now let’s do our own (slightly less fancy) heat map with the mtcars dataset
mycorrs <- cor(mtcars[, 4:8], use = "complete.obs", method = "spearman")
heatmap(mycorrs)
The t-test allows us to test
It is important to remember, that with a t-test you can only compare a max of 2 groups, or 2 conditions. Use ANOVA or linear mixed models if you have more groups and/or conditions.
By default, the function t.test() assumes unequal variances and applies the Welsh degrees of freedom (df) modification. To specify equal variances, use the var.equal = TRUE option.
Also by default, a 2-tailed test is performed. Specify alternative="less" or alternative="more" to run a 1-tailed test.
For example, we could test if the average horse power of all cars in the dataset mtcars differs from 0 (which it does), and from 150 (which it doesn’t)
t.test(mtcars$hp, mu = 0)
##
## One Sample t-test
##
## data: mtcars$hp
## t = 12.103, df = 31, p-value = 2.794e-13
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 121.9679 171.4071
## sample estimates:
## mean of x
## 146.6875
t.test(mtcars$hp, mu = 150)
##
## One Sample t-test
##
## data: mtcars$hp
## t = -0.2733, df = 31, p-value = 0.7864
## alternative hypothesis: true mean is not equal to 150
## 95 percent confidence interval:
## 121.9679 171.4071
## sample estimates:
## mean of x
## 146.6875
Attention: The specification of the formula depends depending on whether your data is in long or wide format
In long format we specify a numeric and a binary factor (separated by “~”):
# select only chicken that followed diets 1 or 2 (we can only compare 2 groups with t-test):
ChickWeight_temp <- ChickWeight[ChickWeight$Diet == 1 | ChickWeight$Diet == 2, ]
# average across chicken by time and diet
ChickWeight_temp_long <- aggregate(ChickWeight_temp, by = list(ChickWeight_temp$Time, ChickWeight_temp$Diet), FUN = mean,
na.rm = TRUE)
# make a bit nicer
ChickWeight_temp_long <- ChickWeight_temp_long[, c(1:3)]
colnames(ChickWeight_temp_long)[1:2] <- c("Time", "Diet")
# now test if weight differs between diet 1 and 2
t.test(ChickWeight_temp_long$weight ~ as.factor(ChickWeight_temp_long$Diet))
##
## Welch Two Sample t-test
##
## data: ChickWeight_temp_long$weight by as.factor(ChickWeight_temp_long$Diet)
## t = -0.74786, df = 20.988, p-value = 0.4628
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -63.98631 30.13885
## sample estimates:
## mean in group 1 mean in group 2
## 105.6929 122.6167
Assume equal variances (this changes the dfs, and slightly the p value):
t.test(ChickWeight_temp_long$weight ~ as.factor(ChickWeight_temp_long$Diet), var.equal = TRUE)
##
## Two Sample t-test
##
## data: ChickWeight_temp_long$weight by as.factor(ChickWeight_temp_long$Diet)
## t = -0.74786, df = 22, p-value = 0.4625
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -63.85473 30.00727
## sample estimates:
## mean in group 1 mean in group 2
## 105.6929 122.6167
Change to 1-tailed t-test (modifies the p value)
t.test(ChickWeight_temp_long$weight ~ as.factor(ChickWeight_temp_long$Diet), var.equal = TRUE, alternative = "less")
##
## Two Sample t-test
##
## data: ChickWeight_temp_long$weight by as.factor(ChickWeight_temp_long$Diet)
## t = -0.74786, df = 22, p-value = 0.2312
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf 21.93463
## sample estimates:
## mean in group 1 mean in group 2
## 105.6929 122.6167
t.test(ChickWeight_temp_long$weight ~ as.factor(ChickWeight_temp_long$Diet), var.equal = TRUE, alternative = "greater")
##
## Two Sample t-test
##
## data: ChickWeight_temp_long$weight by as.factor(ChickWeight_temp_long$Diet)
## t = -0.74786, df = 22, p-value = 0.7688
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -55.78209 Inf
## sample estimates:
## mean in group 1 mean in group 2
## 105.6929 122.6167
Now let’s put the same data in wide format, to compare 2 numeric factors (separated by “,”):
library(reshape2)
ChickWeight_temp_wide <- dcast(ChickWeight_temp_long, Time ~ Diet, value.var = "weight")
colnames(ChickWeight_temp_wide)[2:3] <- c("Diet1", "Diet2")
t.test(ChickWeight_temp_wide$Diet1, ChickWeight_temp_wide$Diet2)
##
## Welch Two Sample t-test
##
## data: ChickWeight_temp_wide$Diet1 and ChickWeight_temp_wide$Diet2
## t = -0.74786, df = 20.988, p-value = 0.4628
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -63.98631 30.13885
## sample estimates:
## mean of x mean of y
## 105.6929 122.6167
It is important to point out that you get the same t and p values and dfs using the data in long or wide format!
We could use the same ChickenWeight data, but this time compare identical chicken across 2 time points.
IMPORTANT: add the option PAIRED
In long format we specify a numeric and a binary factor (separated by “~”):
# select only Time 1 and 2:
ChickWeight_temp <- ChickWeight[ChickWeight$Time <= 2, ]
t.test(ChickWeight_temp$weight ~ as.factor(ChickWeight_temp$Time), paired = TRUE)
##
## Paired t-test
##
## data: ChickWeight_temp$weight by as.factor(ChickWeight_temp$Time)
## t = -15.907, df = 49, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -9.190877 -7.129123
## sample estimates:
## mean of the differences
## -8.16
In the wide format, we specify 2 numeric factors (separated by “,”)
library(reshape2)
ChickWeight_temp <- ChickWeight[ChickWeight$Time <= 2, ]
ChickWeight_temp_wide <- dcast(ChickWeight_temp, Chick + Diet ~ Time, value.var = "weight")
colnames(ChickWeight_temp_wide)[3:4] <- c("time0", "time2")
t.test(ChickWeight_temp_wide$time0, ChickWeight_temp_wide$time2, paired = TRUE)
##
## Paired t-test
##
## data: ChickWeight_temp_wide$time0 and ChickWeight_temp_wide$time2
## t = -15.907, df = 49, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -9.190877 -7.129123
## sample estimates:
## mean of the differences
## -8.16
One of the assumptions of the t-test, is that the data are normally distributed. However, even in cases when the data is NOT normally distributed, one can use the t-test as long as the sample size is reasonably large: N at least >10, better >100. Thus, the t-test is quite robust for violations of the normality assumption, as long as the sample is not too small.
An alternative to the t-test exists , which does not require normal distribution of the data:
The non-parametric Mann-Whitney U test (also called Mann-Whitney-Wilcoxon, Wilcoxon rank-sum, or Wilcoxon-Mann-Whitney test) can be used as alternative to the independent t-test.
Itis calculated in 2 different ways, depending on the sample size (<20 or >20). In any case, it involves some ranking of the values of both groups, and does not assume normal distribution of the data.
The Wilcoxon signed ranks test can be used as an alternative to the dependent t-test
Implementation of these nonparametric tests is easy, and the formula almost identical to the t-test one.
Here an example of the Mann-Whitney U test:
# select again only chicken that followed diets 1 or 2 (we can only compare 2 groups with t-test or Mann-Whitney):
ChickWeight_temp <- ChickWeight[ChickWeight$Diet == 1 | ChickWeight$Diet == 2, ]
wilcox.test(ChickWeight_temp[, 1], ChickWeight_temp[, 2])
##
## Wilcoxon rank sum test with continuity correction
##
## data: ChickWeight_temp[, 1] and ChickWeight_temp[, 2]
## W = 115600, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
And here an example of the Wilcoxon signed ranks test:
# select only Time 1 and 2 (applied to all chicken):
ChickWeight_temp <- ChickWeight[ChickWeight$Time <= 2, ]
# put Time in wide format
library(reshape2)
ChickWeight_temp_wide <- dcast(ChickWeight_temp, Chick + Diet ~ Time, value.var = "weight")
colnames(ChickWeight_temp_wide)[3:4] <- c("time0", "time2")
# perform statistics:
wilcox.test(ChickWeight_temp_wide$time0, ChickWeight_temp_wide$time2, paired = TRUE)
##
## Wilcoxon signed rank test with continuity correction
##
## data: ChickWeight_temp_wide$time0 and ChickWeight_temp_wide$time2
## V = 8, p-value = 1.132e-09
## alternative hypothesis: true location shift is not equal to 0
A simple linear regression is similar to a correlation: You want to know the link between two variables. In a regression, however, you go further than in a correlation, as you want to know how much y depends (is predicted by) x.
A regression therefore tries to predict changes in the dependent variable based on 1 or more independent variables (predictors), even for a range of the IV that is yet unknown.
For example, the mtcars dataset includes cars that have between 52 and 335 horse power. By fitting a regression model with miles per gallon (mpg) as DV, and horse power (hp) as IV, we can try to predict what the mpg would be for a car with 400 hp.
The general formula for a linear regression is lm(DV ~ IV, yourdata), where DV is the dependent variable, and IV the independent one. Get the result using summary(), or anova().
Let’s test if the horsepower of cars predicts their gas consumption (in mpg):
model1 <- lm(mpg ~ hp, mtcars)
summary(model1)
##
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7121 -2.1122 -0.8854 1.5819 8.2360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
## hp -0.06823 0.01012 -6.742 1.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
How to read the output:
R-squared informs us about the amount of variance of mpg that is explained by the model (in this case 60%). Since a simple regression only includes 1 IV, R-squared also represents how much variance is explained by hp. If you take the square root of R-squared, you get the Pearson correlation coefficient r: sqrt(0.6024) = 0.77 Thus, there is a strong correlation between mpg and hp.
The Adjusted R-squared indicates how well the model generalizes to the population. This value should be close to R-squared, but is typically a bit smaller.
The significant F-Statistic in the last line informs us that the model predicts well the DV. Since the model only includes 1 IV, the F-Statistic in this case also tells us that hp is a good predictor of mpg.
The most important part of the output is the Coefficients table:
Let’s verify these things by plotting the scatterplot with the fitted regression line. To have the line extend all the way from 0 to 400 hp, we need to include xlim(0,400) and the fullrange=TRUE option.
library(ggplot2)
ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point() + xlim(0, 400) + stat_smooth(method = "lm", fullrange = TRUE)
In the summary of the regression model, the 3rd line from the bottom provides the degrees of freedom (30).
The statistics can be reported like this: hp significantly predicted mpg, as confirmed by linear regression (t(30) = -6.742, p < .001, R-squared = .6).
Same as simple regression, only that you want to predict changes in the DV based on several IVs.
A more complicated model, that icludes several predictors, could help explain more variance, and thus achieve a greater R-squared.
You can compare the fit (R-squared) of different regression models using the function anova().
For example, do we explain more variance by also including the car’s cylinders?
model1 <- lm(mpg ~ hp, mtcars)
model2 <- lm(mpg ~ hp + cyl, mtcars)
anova(model1, model2)
## Analysis of Variance Table
##
## Model 1: mpg ~ hp
## Model 2: mpg ~ hp + cyl
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 447.67
## 2 29 291.97 1 155.7 15.465 0.0004804 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Yes we do. The comparison of the 2 models showed that model2 fits better to the data. The R-squared of the 2nd model (the more complex one that includes 2 predictors) is 0.74, and thus significantly bigger than the R-squared of the 1st model (0.60).
c("mod1: ", round(summary(model1)$r.squared, digits = 2))
## [1] "mod1: " "0.6"
c("mod2: ", round(summary(model2)$r.squared, digits = 2))
## [1] "mod2: " "0.74"
By using multiple instead of simple regression, we were able to explain 14% more variance!
Which of the 2 IVs is more important in mod2? To answer that question, we can look at their estimates, SEs, and t-values:
summary(model2)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.9083305 2.19079864 16.846975 1.620660e-16
## hp -0.0191217 0.01500073 -1.274718 2.125285e-01
## cyl -2.2646936 0.57588924 -3.932516 4.803752e-04
Only cyl significantly predicts mpg (as indicated by the p value being < .05).Therefore, the number of cylinders of a car has a bigger effect on gas consumption than horse power.
Another way to look at this is by using the standardized estimates. For that, we use the function lm.beta() from the package QuantPsyc
library(QuantPsyc)
lm.beta(model2)
## hp cyl
## -0.2175294 -0.6710802
The standardized beta estimates indicate that gas concumption increases (mpg decreases) by 0.67 standard deviations (SDs), when the number of cylinders increases by 1 SD. The same change in hp only leads to a 0.21 SD change in mpg.
Similarly, the 95% confidence interval (CI) of the unstandardized beta estimate for cyl does not cross 0, compared to the CI for hp. This clearly indicates that among the 2 predictors, the number of cylinders is more reliable (the slope is always negative) and plays a bigger role.
confint(model2)
## 2.5 % 97.5 %
## (Intercept) 32.42764417 41.38901679
## hp -0.04980163 0.01155824
## cyl -3.44251935 -1.08686785
We can also represent the size of each estimate, and the CI around it, using the function plot_model() of the package sjPlot. I have sorted the highest effect at the top.
library(sjPlot)
plot_model(model2)
One could also add more IVs to the model. However, adding displacement (disp) does not significantly improve model fit
model2 <- lm(mpg ~ hp + cyl, mtcars)
model3 <- lm(mpg ~ hp + cyl + disp, mtcars)
anova(model2, model3)
## Analysis of Variance Table
##
## Model 1: mpg ~ hp + cyl
## Model 2: mpg ~ hp + cyl + disp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 29 291.98
## 2 28 261.37 1 30.605 3.2787 0.08093 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Adding weight (wt) does improve the model significantly:
model2 <- lm(mpg ~ hp + cyl, mtcars)
model4 <- lm(mpg ~ hp + cyl + wt, mtcars)
anova(model2, model4)
## Analysis of Variance Table
##
## Model 1: mpg ~ hp + cyl
## Model 2: mpg ~ hp + cyl + wt
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 29 291.98
## 2 28 176.62 1 115.35 18.287 0.0001995 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Indeed, the R-squared is now 0.84. But notice how adding weight to the regression, makes hp and cyl to insignificant predictors (their p values >.05). This is the case because the IVs included in the model are not independent of each other. The estimate (slope) of each IV assumes that the other IVs stay constant, and in fact correspond to their average values. Thus, when weight and horse power do not change, the number of cylinders no longer plays a role!!
summary(model4)
##
## Call:
## lm(formula = mpg ~ hp + cyl + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9290 -1.5598 -0.5311 1.1850 5.8986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.75179 1.78686 21.687 < 2e-16 ***
## hp -0.01804 0.01188 -1.519 0.140015
## cyl -0.94162 0.55092 -1.709 0.098480 .
## wt -3.16697 0.74058 -4.276 0.000199 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared: 0.8431, Adjusted R-squared: 0.8263
## F-statistic: 50.17 on 3 and 28 DF, p-value: 2.184e-11
library(sjPlot)
plot_model(model4)
The Analysis of Variance (ANOVA) is a special case of a linear regression, or linear model, in which the predictor variable(s) are categorical (instead of continuous)
For example, you could have the between-subjects (meaning different for each subject) factor “treatment” with 3 levels: drug A, drug B, and placebo. And/or the within-subjects (meaning every subject does all of them) factor Condition, with 3 levels: Attention to faces, to houses, and control condition.
The fact that linear regression and ANOVA are similar, becomes evident by the fact that you can also look at the results of a linear regression using the anova() function.
Let’s again regress mpg on hp, using the mtcars data.
summary(lm(mpg ~ hp, mtcars))
##
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7121 -2.1122 -0.8854 1.5819 8.2360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
## hp -0.06823 0.01012 -6.742 1.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
Notice that in the last line the F-statistics are given. Indeed, you could also look at the results of linear regression using the anova() function:
anova(lm(mpg ~ hp, mtcars))
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## hp 1 678.37 678.37 45.46 1.788e-07 ***
## Residuals 30 447.67 14.92
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Another way to run an ANOVA and get a results table similar to the one of SPSS, is to use the aov() function
The first argument is the dependent variable. It is followed by the tilde symbol (~) and the independent variable(s). finally you give the formula the dataset to be used. you store the analysis in an object, and then look at the results using summary()
It is also possible to get the means and number of subjs per cell using print(model.tables()) - only makes sense with a categorical predictor
Let’s test again if mpg differs by hp
model2 <- aov(mpg ~ hp, mtcars)
summary(model2)
## Df Sum Sq Mean Sq F value Pr(>F)
## hp 1 678.4 678.4 45.46 1.79e-07 ***
## Residuals 30 447.7 14.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
And to use a categorical predictor, let’s see if mpg differs by number of cylinders (which it does)
# first convert cyl from numeric to factor
mtcars_temp <- mtcars
mtcars_temp$cyl <- as.factor(mtcars_temp$cyl)
# then run the anova
model3 <- aov(mpg ~ cyl, mtcars_temp)
summary(model3)
## Df Sum Sq Mean Sq F value Pr(>F)
## cyl 2 824.8 412.4 39.7 4.98e-09 ***
## Residuals 29 301.3 10.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A quick way to get the means and number of cars per cylinder-group is to use model.tables():
print(model.tables(model3, "means"), digits = 3)
## Tables of means
## Grand mean
##
## 20.09062
##
## cyl
## 4 6 8
## 26.7 19.7 15.1
## rep 11.0 7.0 14.0
Do we get the same means with ddply? Yes we do!
ddply(mtcars, c("cyl"), summarise, N_cars = length(cyl), mpg_mean = round(mean(mpg), digits = 1))
## cyl N_cars mpg_mean
## 1 4 11 26.7
## 2 6 7 19.7
## 3 8 14 15.1
To display boxplots of mpg by cylinder-group:
boxplot(mpg ~ cyl, data = mtcars_temp)
See this introduction to repeated measures Analysis of Variance (rmANOVA) by DataCamp.
There are different ways of computing an rmANOVA. For example, in his book Discovering Statistics Using R Andy Field lists 4 ways.
Of these, the azANOVA() function from the az package is (as of November 2018), not available for the latest R version (3.5.1).
I thus recommend using aov(), or a linear mixed model (LMM) with the function lmer() from the package lme4
Let’s start by using `aov()``:
Comment generously
I strongly advise to put a lot of comments into your R code. This will make it easier to understand for your colleagues with whom you might share it, but also for you when you look at it after some months or years.
In R, everything appearing after a hashtag (#) is interpreted as a comment, and therefore not executed as code.
In contrast to other programming languages (e.g. Python), you however cannot write comments that span several lines. To do that, write # at the beginning of each new line.